In this workshop you will get practice in choosing between, performing, and presenting the results of one- and two- sample t-tests and their non-parametric equivalents in R.
By actively following the materials and carrying out the independent study before and after the contact hours the successful student will be able to:
Workshops are not a test. It is expected that you often donβt know how to start, make a lot of mistakes and need help. Do not be put off and donβt let what you can not do interfere with what you can do. You will benefit from collaborating with others and/or discussing your results. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. You may wish to refer to the independent study materials for reference.
Materials are indexed here: https://3mmarand.github.io/BIO00017C-Data-Analysis-in-R-2020/
These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.
is something you need to do on your computer. It may be opening programs or documents or locating a file.
is something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.
is something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.
is question for you to think about an answer. You will usually want to record your answers in your script for future reference.
Artwork by Allison Horst
Start RStudio from the Start menu.
Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project and then New Directory, then New Project. Navigate to the βdata-analysis-in-rβ folder. Name the RStudio Project βworkshop4β.
On the Files tab click on New Folder. In the box that appears type βdataβ. This will be the folder that we save data files to.
Make a new script then save it with a name like analysis.R to carry out the rest of the work.
Load the
tidyverse:
library(tidyverse)Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in adipocytes.txt. Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not.
Save a copy of the data file adipocytes.txt
Read in the data and check the structure. I used the name
adip for the dataframe/tibble.
#---CODING ANSWER---
# import
adip <- read_table2("data/adipocytes.txt")
str(adip)We have a tibble containing two variables: adiponectin is the response and is continuous and treatment is explanatory. treatment is categorical with two levels (groups). The first task is visualise the data to get an overview. For continuous response variables with categorical explanatory variables you could use geom_point(), geom_boxplot() or a variety of other geoms. I often use geom_violin() which allows us to see the distribution - the violin is fatter where there are more data points.
Do a quick plot of the data:
ggplot(data = adip, aes(x = treatment, y = adiponectin)) +
geom_violin()Top Tip
Always do a quick data visualisation before you do any analysis.
Summarising the data for each treatment group is the next sensible step. The most useful summary statistics are the means, standard deviations, sample sizes and standard errors.
Create a data frame called
adipsummary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples. You may want to look this up in your script from the second and third workshops.
You should get the following numbers:
| treatment | mean | std | n | se |
|---|---|---|---|---|
| control | 5.546000 | 1.475247 | 15 | 0.3809072 |
| nicotinic | 7.508667 | 1.793898 | 15 | 0.4631824 |
Do you think this is a paired-sample test or two-sample test?
You can carry out a two-sample t-test with:
t.test(data = adip,
adiponectin ~ treatment,
var.equal = T)##
## Two Sample t-test
##
## data: adiponectin by treatment
## t = -3.2728, df = 28, p-value = 0.00283
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.1910762 -0.7342571
## sample estimates:
## mean in group control mean in group nicotinic
## 5.546000 7.508667
What do you conclude from the test? Write your conclusion in a form suitable for a report.
The assumptions of the two-sample t-test are that the residuals β the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we need to calculate the residuals and we can do this in two steps:
First by adding a column that holds the mean for the group each value belongs to by βmergingβ the first two columns (
treatment and mean) of the summary data into the raw data:
# add the group means to the data
adip <- merge(adip, adipsummary[1:2], by = "treatment")adipsummary[1:2] is the first two columns of adipsummary: we want to add the mean and need the treatment to match the rows. We donβt need to add the other columns.
Every row where treatment in the adip dataframe matches treatment in the adipsummary dataframe, the mean, std column of adipsummary, is added to adip.
look at the
adip dataframe to see what this code has done.
Second by adding a column for each βresidualβ
# add the residuals
adip <- adip %>%
mutate(residual = adiponectin - mean)Now we are ready to examine the distribution of the residuals.
Check the residuals are homogenously distributed (variance is the same in both groups):
ggplot(data = adip,
aes(x = mean, y = residual)) +
geom_point()Thereβs a bit of an outlier in one group but this looks βokβ.
Check the residuals are noramlly distributed:
ggplot(data = adip,
aes(x = residual)) +
geom_histogram(bins = 10)This also looks ok.
We can also use a Shapiro-Wilk normality test. The null hypothesis of this test is that the residuals follow a normal distribution.
Run a normality test:
shapiro.test(adip$residual)##
## Shapiro-Wilk normality test
##
## data: adip$residual
## W = 0.97562, p-value = 0.701
Usually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the t-test assumptions are met.
What do you conclude from the result of the normality tests?
We are going to create a figure like this:
In this figure, we have the data points themselves which are in adip dataframe and the means and standard errors which are in the adipsummary dataframe. That is, we have two dataframes we want to plot. Here you will learn that dataframes and aesthetics can be specified within a geom_... (rather than in the ggplot()) if the geom only applies to some of the data you want to plot.
We will build the plot up in small steps but as you get more used to ggplot youβll probably be able to create figures in fewer steps.
You can either make a new plot each time OR edit your existing ggplot() command as we go.
First, create an empty plot:
ggplot() Now add the data points:
ggplot() +
geom_point(data = adip, aes(x = treatment, y = adiponectin))Notice how we have given the data argument and the aesthetics inside the geom. The variables treatment and adiponectin are in the adip dataframe
So the data points donβt overlap, we can add some random jitter in the x direction:
ggplot() +
geom_point(data = adip, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0))Weβve set the vertical jitter to 0 because, in contrast to the categorical x-axis, movement on the y-axis has meaning (adiponectin).
Letβs make the points a light grey:
ggplot() +
geom_point(data = adip, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50")Now to add the errorbars. These go from one standard error below the mean to one standard error above the mean.
Add a
geom_errorbar() for errorbars:
ggplot() +
geom_point(data = adip, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = adipsummary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) We have specified the adipsummary dataframe and the variables treatment, mean and se are in that.
There are several ways you could add the mean. You could use geom_point() but I like to use geom_errorbar() again with the ymin and ymax both set to the mean.
Add a
geom_errorbar() for the mean:
ggplot() +
geom_point(data = adip, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = adipsummary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipsummary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2) Alter the axis labels and limits:
ggplot() +
geom_point(data = adip, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = adipsummary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipsummary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Adiponectin (pg/mL)",
limits = c(0, 12),
expand = c(0, 0)) +
scale_x_discrete(name = "Treatment", labels = c("Control", "Nicotinic acid")) Format the figure in a way that is more suitable for including in a report:
ggplot() +
geom_point(data = adip, aes(x = treatment, y = adiponectin),
position = position_jitter(width = 0.1, height = 0),
colour = "gray50") +
geom_errorbar(data = adipsummary,
aes(x = treatment, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = adipsummary,
aes(x = treatment, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Adiponectin (pg/mL)",
limits = c(0, 12),
expand = c(0, 0)) +
theme_classic()Top Tip
The code required to summarise, test, and plot data for any two-sample t-test AND for any for any one-way ANOVA is exactly the same except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it. Consider making a script called ttest.R or similar with all the code and information you need to reuse it.
Later in the module, youβll learn how to put annotation on the figure like this:
These data come from a sample of grouse shot in Scotland. The grouse livers were dissected and the number of individuals of a parasitic nematode were counted for two estates βGordonβ and βMossβ. We want to know if the two estates have different infection rates.
gordon: 5, 16, 8, 64, 51, 11, 9, 7, 43, 49
moss: 0, 2, 1, 3, 6, 10, 4, 12, 19, 20
Create dataframe for these data. I put it in dataframe in two columns (which allowed me to copy and paste the values) then used
pivot_longer()
You are aim for this:
Using your common sense, do these data look normally distributed?
What test do you suggest?
Summarise the data by finding the median of each group:
Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney):
wilcox.test(data = grouse, nematodes ~ estate)##
## Wilcoxon rank sum exact test
##
## data: nematodes by estate
## W = 78, p-value = 0.03546
## alternative hypothesis: true location shift is not equal to 0
What do you conclude from the test? Write your conclusion in a form suitable for a report.
A box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range.
We can create a simple boxplot with:
ggplot(data = grouse, aes(x = estate, y = nematodes) ) +
geom_boxplot() Format the figure so it is more suitable for a report.
Researchers are interested in the expression levels of a particular set of 35 E.coli genes in response to heat stress. They measure the expression of the genes at 37 and 42 degrees C (labelled low and high temperature). These samples are not independent because we might expect there to be a relationship between expression at 37 and 42 degrees within a gene.
What is the null hypothesis?
Save a copy of coliexp.txt
Read the data in
What is the appropriate parametric test?
Now carry out a paired-sample t-test.
t.test(data = coliexp, expression ~ temperature, paired = T)##
## Paired t-test
##
## data: expression by temperature
## t = 3.2039, df = 34, p-value = 0.002943
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1123712 0.5021946
## sample estimates:
## mean of the differences
## 0.3072829
State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any significant effect.
Decide which test you should use to analyse the each following data sets. In each case give the reasons for your choice of test and state the null hypothesis. Write your conclusions in a form suitable for including in a report. Can you make figures?
Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt. Do you think their modification has been successful?
In order to investigate the effects of feeding fertilised grass to sheep, one of each pair of fourteen sets of twins was fed fertilised grass whilst the other was fed unfertilised grass and the adult weight of the sheep was recorded. The data are in sheep.txt . Is there difference in the effect of fertilised and unfertilised grass on sheep weight?
These contain all the code needed in the workshop even where it is not visible on the webpage.
Rmd file The Rmd file is the file I use to compile the practical. Rmd stands for R markdown. It allows R code and ordinary text to be interweaved to produce well-formatted reports including webpages. If you right-click on the link and choose Save-As, you will be able to open the Rmd file in RStudio. Alternatively, View in Browser.
Plain script file This is plain script (.R) version of the practical generated from the Rmd. Again, you can save this and open it RStudio. Alternatively, View in Browser.
Pages made with rmarkdown (Allaire Xie, et al., 2019a; Xie Allaire, et al., 2018a), kableExtra (Zhu, 2019a), RefManager (McLean, 2014)
Allaire, J., Y. Xie, et al. (2019a). rmarkdown: Dynamic Documents for R. R package version 1.16. URL: https://github.com/rstudio/rmarkdown.
McLean, M. W. (2014). Straightforward Bibliography Management in R Using the RefManager Package. arXiv: 1403.2036 [cs.DL]. URL: https://arxiv.org/abs/1403.2036.
Xie, Y., J. Allaire, et al. (2018a). R Markdown: The Definitive Guide. ISBN 9781138359338. Boca Raton, Florida: Chapman and Hall/CRC. URL: https://bookdown.org/yihui/rmarkdown.
Zhu, H. (2019a). kableExtra: Construct Complex Table with βkableβ and Pipe Syntax. R package version 1.1.0. URL: https://CRAN.R-project.org/package=kableExtra.